Minimum Information Loss Cluster Analysis for Categorical Data

نویسندگان

  • Jirí Grim
  • Jan Hora
چکیده

The EM algorithm has been used repeatedly to identify latent classes in categorical data by estimating finite distribution mixtures of product components. Unfortunately, the underlying mixtures are not uniquely identifiable and, moreover, the estimated mixture parameters are starting-point dependent. For this reason we use the latent class model only to define a set of “elementary” classes by estimating a mixture of a large number components. We propose a hierarchical “bottom up” cluster analysis based on unifying the elementary latent classes sequentially. The clustering procedure is controlled by minimum information loss criterion.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ارائه یک الگوریتم خوشه بندی برای داده های دسته ای با ترکیب معیارها

Clustering is one of the main techniques in data mining. Clustering is a process that classifies data set into groups. In clustering, the data in a cluster are the closest to each other and the data in two different clusters have the most difference. Clustering algorithms are divided into two categories according to the type of data: Clustering algorithms for numerical data and clustering algor...

متن کامل

On Enhancing Data Utility in K-anonymization for Data without Hierarchical Taxonomies

K-anonymity is the model that is widely used to protect the privacy of individuals in publishing microdata. It could be defined as clustering with constrain of minimum k tuples in each group. K-anonymity cuts down the linking confidence between sensitive information and specific individual by the ration of 1/k. However, the accuracy of the data in k-anonymous dataset decreases due to informatio...

متن کامل

Minimum Error Classification Clustering

Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. In this paper, we study on the problem of clustering categorical data, where data objects are made up of non-numerical attributes. We propose MECC (Minimum Error Classification Clustering), an alternative technique for categ...

متن کامل

A Fast K-prototypes Algorithm Using Partial Distance Computation

The k-means is one of the most popular and widely used clustering algorithm, however, it is limited to only numeric data. The k-prototypes algorithm is one of the famous algorithms for dealing with both numeric and categorical data. However, there have been no studies to accelerate k-prototypes algorithm. In this paper, we propose a new fast k-prototypes algorithm that gives the same answer as ...

متن کامل

Integrative Parameter-Free Clustering of Data with Mixed Type Attributes

Integrative mining of heterogeneous data is one of the major challenges for data mining in the next decade. We address the problem of integrative clustering of data with mixed type attributes. Most existing solutions suffer from one or both of the following drawbacks: Either they require input parameters which are difficult to estimate, or/and they do not adequately support mixed type attribute...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007